Router-Expert Alignment Analysis

This report performs SVD-based alignment analysis between router vectors and expert weight matrices.

Plot Explanations

The following section explains what each plot type shows and how it is computed. All plots use layer numbers (e.g., L5, L10) in their legends, without timestamps.

Comparison Plots

Comparison Plots: This figure contains four subplots comparing multiple analysis runs:

Cos²(θ) Expert Comparison

Cos²(θ) Expert Comparison: This figure compares cos²(θ) values across experts and layers (for k=1 only):

Shuffle Statistics

Shuffle Statistics: This figure shows statistics from shuffled baseline comparisons:

Z-score Decomposition

Z-score Decomposition: This figure breaks down the z-score calculation into its components:

Distribution Comparison

Distribution Comparison: This plot shows the probability distribution of shuffled alignments (projection energies) compared to the true alignment value:

Per-Expert Breakdown

Per-Expert Breakdown: This figure provides detailed expert-level analysis:

Complete Analysis Visualization

Complete Analysis Visualization: This comprehensive figure contains 12 subplots showing all key metrics for a single analysis run: All metrics are computed as described in the individual plot explanations above.

Ambiguity Score Analysis

Per-Layer Ambiguity Score Analysis: This analysis combines multiple metrics to identify layers with potential routing instability and load imbalance.

Inter-Expert Orthogonality

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.

Interpretation Summary:

1. Comparison Across All Runs

This section compares all 6 result files side by side.

Comparison Plots

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Comparison Plots

Unique Identification Comparison

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Unique Identification Comparison

Cos²(θ) Expert Comparison

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Cos²(θ) Expert Comparison

Diagnostic Plots (Comparison)

Diagnostic plots comparing all runs to understand differences in alignment, z-score, and delta.

Shuffle Statistics

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Shuffle Statistics

Z-score Decomposition

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Z-score Decomposition

Distribution Comparison (k=32)

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Distribution Comparison (k=32)

Distribution Comparison (k=128)

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Distribution Comparison (k=128)

Distribution Comparison (k=512)

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Distribution Comparison (k=512)

Distribution Comparison (k=2048)

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Distribution Comparison (k=2048)

Per-Expert Breakdown

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Per-Expert Breakdown

Inter-Expert Orthogonality Analysis (Comparison)

This section compares expert-to-expert orthogonality across all runs. Lower off-diagonal similarity indicates better expert diversity.

Inter-Expert Orthogonality - L0

Statistics:

  • Mean off-diagonal similarity: -0.0603
  • Mean absolute off-diagonal similarity: 0.3233
  • Max off-diagonal similarity: 0.8906
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L0

Inter-Expert Orthogonality - L1

Statistics:

  • Mean off-diagonal similarity: 0.3233
  • Mean absolute off-diagonal similarity: 0.3306
  • Max off-diagonal similarity: 0.8950
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L1

Inter-Expert Orthogonality - L9

Statistics:

  • Mean off-diagonal similarity: -0.0458
  • Mean absolute off-diagonal similarity: 0.4108
  • Max off-diagonal similarity: 0.6760
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L9

Inter-Expert Orthogonality - L11

Statistics:

  • Mean off-diagonal similarity: 0.1829
  • Mean absolute off-diagonal similarity: 0.3728
  • Max off-diagonal similarity: 0.6411
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L11

Inter-Expert Orthogonality - L21

Statistics:

  • Mean off-diagonal similarity: 0.2060
  • Mean absolute off-diagonal similarity: 0.3981
  • Max off-diagonal similarity: 0.7257
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L21

Inter-Expert Orthogonality - L30

Statistics:

  • Mean off-diagonal similarity: 0.1581
  • Mean absolute off-diagonal similarity: 0.4320
  • Max off-diagonal similarity: 0.6343
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L30

Unique Identification Analysis (Comparison)

This section compares unique identification metrics across all runs. These metrics test whether alignment is strong enough to uniquely identify experts, not just above chance.

⚠️ Could not generate unique identification comparison plots: 'numpy.ndarray' object has no attribute 'axis'

Per-Layer Ambiguity Score Analysis

Purpose: The Ambiguity Score combines multiple metrics (argmax accuracy, alignment margin, z-score) to identify layers with potential routing instability and load imbalance. Higher scores indicate more ambiguous layers.

Formula: AmbiguityScore = α·(1 - ArgmaxAccuracy) + β·(1 - NormalizedMargin) + γ·(1 - NormalizedZScore)

Interpretation: Layers ranked highest by ambiguity score are hypothesized to be at higher risk of routing instability and load imbalance.

Ambiguity Score Rankings (sorted by score, highest ambiguity first):

Rank Layer Ambiguity Score k* Argmax Accuracy Alignment Margin Z-Score
1 0 0.9500 4096 0.125 0.000000 0.00
2 1 0.3115 16 1.000 0.062931 2.04
3 11 0.2291 512 1.000 0.146158 2.15
4 9 0.0975 1024 1.000 0.261166 2.46
5 30 0.0725 128 1.000 0.278289 2.55
6 21 0.0000 64 1.000 0.357489 2.60

Ambiguity Score Visualization

Per-Layer Ambiguity Score Analysis: This analysis combines multiple metrics to identify layers with potential routing instability and load imbalance.
  • Purpose: The Ambiguity Score quantifies how "ambiguous" a layer is in terms of router-expert alignment. Higher scores indicate layers where routing decisions may be less clear, potentially leading to instability or load imbalance.
  • Method:
    • Step 1 - Select k*: For each layer, select a fixed k* value. Options include:
      • max_margin: Use the k value where alignment margin is maximal (default, recommended)
      • fixed_128, fixed_256, fixed_512: Use a fixed k value across all layers
    • Step 2 - Extract Metrics at k*: For each layer, extract:
      • Argmax Accuracy: Fraction of routers where correct expert has maximum alignment
      • Alignment Margin: Mean difference between correct expert's alignment and next-best expert's alignment
      • Z-Score: Mean z-score versus shuffled router-expert pairings
    • Step 3 - Normalize Across Layers: Normalize margins and z-scores to [0, 1] range across all layers for fair comparison.
    • Step 4 - Compute Ambiguity Score: Weighted combination:
      • Formula: AmbiguityScore = α·(1 - ArgmaxAccuracy) + β·(1 - NormalizedMargin) + γ·(1 - NormalizedZScore)
      • Default weights: α = 0.4, β = 0.3, γ = 0.3 (sum to 1.0)
      • Interpretation: Each component measures a different aspect of ambiguity:
        • (1 - ArgmaxAccuracy): How often routers fail to identify the correct expert (higher = more ambiguous)
        • (1 - NormalizedMargin): How small the separation is between correct and incorrect experts (higher = more ambiguous)
        • (1 - NormalizedZScore): How weak the statistical significance is (higher = more ambiguous)
    • Step 5 - Rank Layers: Sort layers by ambiguity score (descending). Highest scores indicate most ambiguous layers.
  • Visualization:
    • Plot 1 - Ambiguity Score by Layer: Bar chart showing ambiguity score for each layer. Higher bars indicate more ambiguous layers.
    • Plot 2 - Components: Stacked or grouped bar chart showing the three components (1 - ArgmaxAccuracy, 1 - NormalizedMargin, 1 - NormalizedZScore) for each layer.
    • Plot 3 - Argmax Accuracy by Layer: Line plot showing argmax accuracy across layers. Lower values indicate more ambiguity.
    • Plot 4 - Alignment Margin by Layer: Line plot showing alignment margin across layers. Lower (or negative) values indicate more ambiguity.
  • Interpretation:
    • High Ambiguity Score (> 0.7): Layer has weak unique identification, small margins, and/or low z-scores. These layers are at high risk of routing instability and load imbalance.
    • Medium Ambiguity Score (0.4-0.7): Layer has moderate ambiguity. Some routing decisions may be unclear, but not critically unstable.
    • Low Ambiguity Score (< 0.4): Layer has strong unique identification, large margins, and high z-scores. These layers have clear routing decisions and are less likely to have stability issues.
    • Ranking: Layers ranked highest (top of the list) are the most ambiguous and should be prioritized for investigation or intervention.
  • Use Cases:
    • Identify Problem Layers: Quickly identify which layers have the most ambiguous routing, helping prioritize debugging or optimization efforts.
    • Compare Architectures: Compare ambiguity scores across different model architectures or training configurations.
    • Monitor Training: Track ambiguity scores during training to detect when layers become more ambiguous (potential sign of training issues).
    • Load Balancing: Layers with high ambiguity scores may benefit from load balancing interventions or routing regularization.
  • Limitations:
    • The score depends on the choice of k* and weights (α, β, γ). Different choices may yield different rankings.
    • Normalization across layers assumes all layers should be compared on the same scale, which may not always be appropriate.
    • High ambiguity doesn't necessarily mean the layer is "bad" - it may be intentionally ambiguous for certain tasks.
Ambiguity Score Analysis

2. Individual Analysis - Run 1

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.005770 0.002528 0.684055 0.005526 0.005770 0.500000 -0.001939
2 0.007577 0.001943 0.400250 0.007089 0.000000 0.250000 -0.005173
4 0.010832 0.001343 0.189122 0.009855 0.000000 0.125000 -0.008038
8 0.016802 -0.007680 -0.556915 0.014849 0.000000 0.000000 -0.026486
16 0.027283 -0.024356 -1.033393 0.023376 0.000000 0.000000 -0.054445
32 0.051706 -0.043072 -1.131536 0.043893 0.000000 0.000000 -0.086332
64 0.117375 -0.040200 -0.718772 0.101750 0.000000 0.125000 -0.095110
128 0.218082 -0.043899 -0.648009 0.186832 0.000000 0.125000 -0.111610
256 0.414622 -0.028932 -0.422043 0.352122 0.000000 0.250000 -0.088238
512 0.637770 -0.013315 -0.271164 0.512770 0.000000 0.125000 -0.057387
1024 0.821602 0.007298 0.219326 0.571602 0.000000 0.125000 -0.031759
2048 0.929245 0.005412 0.370229 0.429245 0.000000 0.250000 -0.014026
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.005770

Max cos²(θ): 0.011631

Min cos²(θ): 0.000310

Std cos²(θ): 0.004499

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.004563 0.004563
1 0.000310 0.000310
2 0.011379 0.011379
3 0.011631 0.011631
4 0.002967 0.002967
5 0.009454 0.009454
6 0.001051 0.001051
7 0.004804 0.004804

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

3. Individual Analysis - Run 2

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.057551 0.047232 2.554473 0.057307 0.057551 1.000000 0.049831
2 0.059381 0.047484 2.482403 0.058893 0.000000 1.000000 0.050548
4 0.068402 0.054134 2.553551 0.067426 0.000000 1.000000 0.056529
8 0.075794 0.054138 2.426632 0.073841 0.000000 1.000000 0.054771
16 0.104669 0.068705 2.044184 0.100762 0.000000 1.000000 0.062931
32 0.139227 0.069573 1.228228 0.131414 0.000000 0.750000 0.056749
64 0.188318 0.063255 0.907765 0.172693 0.000000 0.250000 0.038525
128 0.264706 0.052687 0.590516 0.233456 0.000000 0.125000 0.011865
256 0.416913 0.053419 0.453827 0.354413 0.000000 0.125000 -0.001926
512 0.611965 0.060646 0.445669 0.486965 0.000000 0.125000 0.000992
1024 0.767452 0.057432 0.421641 0.517452 0.000000 0.250000 -0.000392
2048 0.885113 0.037467 0.398855 0.385113 0.000000 0.250000 0.004784
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.057551

Max cos²(θ): 0.070422

Min cos²(θ): 0.042068

Std cos²(θ): 0.011621

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.054994 0.054994
1 0.070422 0.070422
2 0.059095 0.059095
3 0.042068 0.042068
4 0.068530 0.068530
5 0.053046 0.053046
6 0.070022 0.070022
7 0.042234 0.042234

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

4. Individual Analysis - Run 3

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.064961 0.051988 2.218826 0.064717 0.064961 0.875000 0.051164
2 0.074739 0.058293 2.284018 0.074251 0.000000 1.000000 0.058322
4 0.082144 0.062073 2.388455 0.081167 0.000000 1.000000 0.062144
8 0.091984 0.066795 2.488270 0.090031 0.000000 1.000000 0.066244
16 0.111207 0.074492 2.354253 0.107300 0.000000 1.000000 0.073237
32 0.136211 0.085793 2.295637 0.128399 0.000000 1.000000 0.083369
64 0.172688 0.101731 2.336120 0.157063 0.000000 1.000000 0.101138
128 0.219534 0.118886 2.213562 0.188284 0.000000 1.000000 0.116449
256 0.284807 0.144395 2.246814 0.222307 0.000000 1.000000 0.137058
512 0.372220 0.169235 2.153636 0.247220 0.000000 1.000000 0.146158
1024 0.493685 0.183328 1.938368 0.243685 0.000000 1.000000 0.129647
2048 0.676953 0.169473 1.859634 0.176953 0.000000 1.000000 0.093153
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.064961

Max cos²(θ): 0.097719

Min cos²(θ): 0.002639

Std cos²(θ): 0.029249

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.073076 0.073076
1 0.090944 0.090944
2 0.068927 0.068927
3 0.002639 0.002639
4 0.097719 0.097719
5 0.062334 0.062334
6 0.073216 0.073216
7 0.050834 0.050834

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

5. Individual Analysis - Run 4

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.086661 0.072986 2.512674 0.086417 0.086661 1.000000 0.080441
2 0.088815 0.075093 2.719778 0.088327 0.000000 1.000000 0.080075
4 0.102974 0.083705 2.541455 0.101998 0.000000 1.000000 0.089314
8 0.118629 0.091889 2.453338 0.116676 0.000000 1.000000 0.098466
16 0.142135 0.105862 2.694624 0.138229 0.000000 1.000000 0.110616
32 0.174238 0.119605 2.677794 0.166426 0.000000 1.000000 0.123853
64 0.217419 0.131549 2.434596 0.201794 0.000000 1.000000 0.139225
128 0.285893 0.167746 2.627440 0.254643 0.000000 1.000000 0.174368
256 0.375532 0.204085 2.376817 0.313032 0.000000 1.000000 0.215717
512 0.476727 0.248076 2.649691 0.351727 0.000000 1.000000 0.250286
1024 0.602968 0.262075 2.457562 0.352968 0.000000 1.000000 0.261166
2048 0.763862 0.215525 2.429688 0.263862 0.000000 1.000000 0.206376
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.086661

Max cos²(θ): 0.102524

Min cos²(θ): 0.059184

Std cos²(θ): 0.016729

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.059184 0.059184
1 0.093662 0.093662
2 0.098208 0.098208
3 0.102524 0.102524
4 0.102472 0.102471
5 0.087545 0.087545
6 0.085803 0.085803
7 0.063891 0.063891

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

6. Individual Analysis - Run 5

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.044943 0.037256 2.469307 0.044699 0.044943 1.000000 0.040451
2 0.284024 0.241433 2.583244 0.283536 0.000000 1.000000 0.267528
4 0.318368 0.266162 2.487106 0.317391 0.000000 1.000000 0.300490
8 0.339890 0.288258 2.652657 0.337937 0.000000 1.000000 0.319115
16 0.362707 0.299835 2.477195 0.358801 0.000000 1.000000 0.337860
32 0.383622 0.316346 2.569243 0.375809 0.000000 1.000000 0.352929
64 0.401119 0.323436 2.602394 0.385494 0.000000 1.000000 0.357489
128 0.421271 0.322301 2.600011 0.390021 0.000000 1.000000 0.355564
256 0.450908 0.312535 2.582455 0.388408 0.000000 1.000000 0.338534
512 0.491796 0.282973 2.661499 0.366796 0.000000 1.000000 0.296593
1024 0.560068 0.214405 2.610082 0.310068 0.000000 1.000000 0.215217
2048 0.684269 0.095039 2.301127 0.184269 0.000000 1.000000 0.082996
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.044943

Max cos²(θ): 0.069426

Min cos²(θ): 0.021023

Std cos²(θ): 0.014521

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.049431 0.049431
1 0.021023 0.021023
2 0.069426 0.069426
3 0.041228 0.041228
4 0.042395 0.042395
5 0.054745 0.054745
6 0.032556 0.032556
7 0.048737 0.048737

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

7. Individual Analysis - Run 6

Setup and Configuration

Summary Statistics (averaged across experts)

k align delta_vs_shuffle z_vs_shuffle effect_over_random cos_squared argmax_accuracy alignment_margin
1 0.023197 0.016761 1.491834 0.022952 0.023197 0.500000 0.011800
2 0.259938 0.214391 2.386393 0.259449 0.000000 1.000000 0.226243
4 0.286469 0.237247 2.607388 0.285492 0.000000 1.000000 0.248632
8 0.307733 0.242406 2.331280 0.305780 0.000000 1.000000 0.258187
16 0.332480 0.256799 2.505764 0.328573 0.000000 1.000000 0.260972
32 0.367138 0.268525 2.648007 0.359326 0.000000 1.000000 0.268582
64 0.408700 0.270629 2.431572 0.393075 0.000000 1.000000 0.273820
128 0.453305 0.279122 2.549925 0.422055 0.000000 1.000000 0.278289
256 0.491464 0.273296 2.560864 0.428964 0.000000 1.000000 0.268457
512 0.530967 0.249536 2.353941 0.405967 0.000000 1.000000 0.242936
1024 0.591075 0.211292 2.178432 0.341075 0.000000 1.000000 0.195782
2048 0.697854 0.151148 2.182067 0.197854 0.000000 1.000000 0.115009
4096 1.000000 0.000000 0.000000 0.000000 0.000000 0.125000 0.000000

Cos²(θ) Alignment (k=1)

Mean cos²(θ): 0.023197

Max cos²(θ): 0.063815

Min cos²(θ): 0.000086

Std cos²(θ): 0.025389

Per-expert cos²(θ) values:

Expertcos²(θ)align
0 0.000086 0.000086
1 0.031873 0.031873
2 0.006196 0.006196
3 0.015753 0.015753
4 0.008216 0.008216
5 0.001565 0.001565
6 0.063815 0.063815
7 0.058069 0.058069

Detailed Results by K Value

K = 1:

K = 2:

K = 4:

K = 8:

K = 16:

K = 32:

K = 64:

K = 128:

K = 256:

K = 512:

K = 1024:

K = 2048:

K = 4096:

Unique Identification Summary

Interpretation:

Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.

Unique Identification Analysis

Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.

Key Questions:

Unique Identification Analysis

Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
  • Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
    • Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
    • Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
    • Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
    • Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
    • Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
  • Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
    • Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
    • Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
    • Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
    • Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
    • Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
  • Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
    • Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
    • Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
    • Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
    • What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
  • Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
    • Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
    • What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.

Interpretation Summary:

  • Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
  • Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
  • Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
  • Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Unique Identification Analysis

Inter-Expert Orthogonality Analysis

Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.

Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.

Interpretation:

Orthogonality Statistics:

Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.

Inter-Expert Orthogonality Heatmap (k=2)

Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
  • Method:
    • Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
    • Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
    • Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
  • Visualization:
    • Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
    • Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
    • Grid: Grid lines separate cells for better readability.
    • Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
  • Interpretation:
    • Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
    • High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
    • Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
    • Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
  • Statistics Reported:
    • Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
    • Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
    • Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
    • Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
    • Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
  • Why This Matters:
    • Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
    • Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
    • Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality

Inter-Expert Orthogonality Comparison Across k Values

Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.

Inter-Expert Orthogonality Comparison

Complete Analysis Plots

Comprehensive visualization of all metrics for this run.

Complete Analysis Visualization

See "Plot Explanations" section at the top of this report for detailed information about this plot.

Complete Analysis

Generated on 2025-12-28 10:35:19